Skip to content

Add trim_galore test fixtures (for MultiQC#3538)#377

Merged
ewels merged 1 commit into
MultiQC:mainfrom
FelixKrueger:add-trim-galore-fixtures
May 11, 2026
Merged

Add trim_galore test fixtures (for MultiQC#3538)#377
ewels merged 1 commit into
MultiQC:mainfrom
FelixKrueger:add-trim-galore-fixtures

Conversation

@FelixKrueger
Copy link
Copy Markdown
Contributor

@FelixKrueger FelixKrueger commented Apr 27, 2026

Companion PR to MultiQC/MultiQC#3538, which adds a native MultiQC module for Trim Galore v2.x.

This PR drops the test fixtures the new module needs into data/modules/trim_galore/. Real Trim Galore v2.1.0-beta.5 JSON outputs:

  • sample_R1.fastq.gz_trimming_report.json — single-end (10K Illumina, long adapter-length distribution including the 1-bp tail typical for the default --stringency 1)
  • BS-seq_10K_R{1,2}.fastq.gz_trimming_report.json — paired-end pair (BS-seq 10K, with pair_validation populated and short adapter-length tails — covers the PE code path)

Schema reference: schema_version: 1, documented in the upstream MultiQC issue thread.

The MultiQC PR's test_modules_run.py::test_all_modules[trim_galore-…] and test_ignore_samples[trim_galore-…] checks fail until this PR merges (they look for test-data/data/modules/trim_galore/). Happy to coordinate merge order — most natural is to merge this first, then unblock the MultiQC PR's CI.

Test fixtures for the new MultiQC `trim_galore` module proposed in
MultiQC/MultiQC#3538. These are real Trim Galore v2.1.0-beta.5 outputs:

- sample_R1.fastq.gz_trimming_report.json — single-end (10K Illumina,
  long adapter-length distribution including the 1-bp tail typical for
  --stringency 1 default)
- BS-seq_10K_R{1,2}.fastq.gz_trimming_report.json — paired-end pair
  (BS-seq 10K, with `pair_validation` populated and short adapter-length
  tails — useful coverage for the PE code path)

Schema reference: schema_version 1, documented in the upstream issue
thread at MultiQC/MultiQC#3529.
FelixKrueger added a commit to FelixKrueger/MultiQC that referenced this pull request Apr 27, 2026
The fixtures originally landed in `tests/data/modules/trim_galore/` of
the main repo, but `test_modules_run.py` resolves test data via
`<repo>/test-data/data/modules/<module>/` (a separate sibling repo,
MultiQC/test-data). Moving them there in MultiQC/test-data#377.
Copy link
Copy Markdown
Member

@ewels ewels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are still up to date, post-release right?

Nit: Please could you include the associated log files as well? I want to make sure that we don't show sections for both TrimGalore and Cutadapt together, so it would help to test for that blocking effect.

@ewels ewels merged commit cd690cb into MultiQC:main May 11, 2026
ewels added a commit to MultiQC/MultiQC that referenced this pull request May 11, 2026
* Add native MultiQC module for Trim Galore v2.x (Oxidized Edition)

Closes #3529.

Trim Galore v2.x emits a structured `*_trimming_report.json` (schema v1)
alongside the legacy `*_trimming_report.txt` report. The text report
still carries the `"This is cutadapt"` shim for backwards compatibility,
so the existing `cutadapt` module path keeps working unchanged. This new
module parses the JSON natively, which:

- Gets the Software Versions table right ("Trim Galore X.Y.Z" instead of
  the misleading "Cutadapt 4.0" backwards-compat shim)
- Surfaces TrimGalore-specific stats not available from Cutadapt output
  (RRBS truncation counts, poly-A/G trimming, paired-end pair-validation
  outcomes — the latter two are wired through to the data file but not
  yet plotted; happy to add follow-up sections)

## What's plotted

- General stats columns: % adapter, % pass, % q-trimmed, total reads
  (hidden), total bp written (hidden)
- Filtered reads bargraph: passing / too_short / too_long / too_many_n
  / discarded_untrimmed
- Adapter length distribution linegraph (per sample, per adapter when a
  sample has more than one)

## Sample-name handling

PE TrimGalore reports list both R1 and R2 in `input_filenames` (both
JSONs do — Trim Galore preserves the pair context). The parser uses the
JSON's `read_number` field to pick the correct filename, so R1 and R2
become distinct samples.

## Coexistence with the cutadapt module

Both modules will discover their respective files (text vs JSON). With
both enabled, each sample appears in both modules' general-stats
columns. Users who want to disable the cutadapt path on TrimGalore
samples can:

```yaml
disable_modules:
  - cutadapt
```

Documented in the module's class docstring.

## Test fixtures

`tests/data/modules/trim_galore/` contains:
- `sample_R1.fastq.gz_trimming_report.json` — SE example (10K Illumina)
- `BS-seq_10K_R{1,2}.fastq.gz_trimming_report.json` — PE example (BS-seq
  10K, with `pair_validation` populated)

Verified locally: `multiqc -m trim_galore tests/data/modules/trim_galore/`
produces 3/3 reports parsed (1 SE + 2 PE), all sections rendered, data
file written to `multiqc_data/multiqc_trim_galore.txt`.

## Schema reference

JSON schema v1 is documented in the upstream issue thread (linked in
the issue body). The parser version-gates on `schema_version: 1` and
warns + skips files with a different version, so a future schema bump
won't silently misparse.

## Status

Marking as draft. Initial scope is intentionally focused — happy to
extend for poly-A/G trimming sections, pair-validation visualisation,
RRBS-specific stats, or anything else the maintainers want before a
final review pass.

* Apply prettier + ruff format from prek hooks

* Move test fixtures to MultiQC/test-data fork (companion PR)

The fixtures originally landed in `tests/data/modules/trim_galore/` of
the main repo, but `test_modules_run.py` resolves test data via
`<repo>/test-data/data/modules/<module>/` (a separate sibling repo,
MultiQC/test-data). Moving them there in MultiQC/test-data#377.

* Address PR review feedback on trim_galore module

- Move write_data_file to end of __init__ and flatten payload to scalar
  columns so multiqc_trim_galore.txt is machine-readable
- Call add_software_version unconditionally; bail on ignored samples
  inside the parse loop
- Drop module-level docstring and unicode divider comments per project
  style; tone down class docstring
- Drop redundant _strip_fastq_suffix helper in favour of clean_s_name
- Add SampleGroupingConfig so PE pairs collapse cleanly with
  table_sample_merge (weighted-average percentages, sum counts)
- Remove hardcoded bargraph colours; use uniform composite keys in the
  adapter-length plot and continue on zero-adapter samples
- Drop % Q-trim precision to {:,.1f}; surface tg_total_reads by default
- Bump schema_version mismatch to log.error with explicit guidance
- Simplify search_patterns.yaml (drop contents/num_lines shim)
- Group trim_galore adjacent to cutadapt in config_defaults.yaml
- Revert CHANGELOG.md entry (generated from PR titles)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Log debug message when JSON tool field is not Trim Galore

Helps diagnostics if a non-TrimGalore JSON happens to match the
filename glob.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Auto-suppress cutadapt module for Trim Galore v2.x text reports

The cutadapt module's text-report pattern matches any file containing
"This is cutadapt", which also catches the backwards-compatibility shim
that Trim Galore v2.x writes alongside its native JSON report. Result:
every v2.x sample shows up twice — once via cutadapt (as a misleading
"Cutadapt 4.0"), once via trim_galore. Telling users to disable cutadapt
globally also kills parsing of genuine cutadapt logs and legacy
Trim Galore v0/v1 reports, so it isn't a real fix.

Add an exclude_contents_re to the cutadapt text-report pattern matching
"Trim Galore version: " followed by a major version of 2 or higher.
v0.x / v1.x text reports continue to be picked up by cutadapt; v2.x
text reports are skipped (the sibling JSON is handled by trim_galore);
pure cutadapt logs are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add Pair Validation, Poly-A/G, and RRBS sections to trim_galore

Surface the schema-v1 fields that were already in the data file but not
plotted: pair_validation, poly_a_trimming, poly_g_trimming, rrbs. Each is
a small table with sensible gating.

- Pair Validation: collapses R1/R2 (pair-level data is identical between
  them), drops rows where less than 0.1% of pairs were affected.
- Poly-A/G and RRBS: per-row gating, samples with zero counts are hidden.
- All three sections show a Bootstrap alert listing dropped samples, with
  long lists wrapped in <details> (bases2fastq pattern).
- Defensive try/except around length_distribution int-coercion so a
  malformed key downgrades to a debug log rather than crashing the run.
- Data file flattening extended to all of pair_validation,
  poly_a_trimming, poly_g_trimming and rrbs blocks (25 columns total).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add support for sample grouping

* Add explicit_groups for deterministic tool-derived sample grouping

Extend SampleGroupingConfig with an `explicit_groups` parameter that
lets modules supply their own ground-truth groups instead of relying on
the user's `table_sample_merge` name patterns. Useful when the tool
output already tells you which samples are related — paired-end
trimmers that emit both filenames in each report, lane manifests,
replicate IDs, etc. The framework silently ignores entries with a
single member so callers don't need to filter them out themselves.

Wire trim_galore to use this. Each JSON's `tuple(input_filenames)` is
a stable pair key (byte-identical between R1 and R2 of the same pair).
Auto-grouping applies to:

- General Stats table — framework path with expand-to-see-individuals
- Pair Validation table — manual collapse keyed on the same pair_key

Filtered Reads bargraph, Poly-A/G and RRBS tables stay per-read because
R1 and R2 stats there can legitimately differ. Users with
`table_sample_merge` configured layer name-pattern grouping on top of
the auto-derived pairs. The `trim_galore_config.auto_group_pairs: false`
flag opts out of auto-grouping entirely.

Replaces the earlier `_apply_grouping` helper that relied on
`config.table_sample_merge` to pre-aggregate filtered_reads / poly /
RRBS — those now stay per-read regardless of grouping config.

Docs updated: developer guide gets a worked example for module authors
with authoritative pair info; user-facing customisation page describes
the auto-grouping behaviour and the opt-out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Simplify trim_galore module after review pass

- Store pair_key_to_samples and pair_display_by_key on `self` so
  _pair_validation_plot and _derive_auto_groups drop their extra
  parameters
- Add a small _add_filtered_section helper that wraps the
  plot+description+alert+add_section pattern, collapsing three
  near-identical 11-line blocks
- Simplify the gen_stats type annotation from a quadruple Union
  workaround to `Dict[str, Dict[ColumnKey, Any]]` plus a single
  `cast(Any, ...)` at the addcols call site
- Drop the unused `Union` import that fell out of the above
- Tighten narrative comments per CLAUDE.md (keep only WHY)

Net: -30 lines, no behaviour change. Lint / mypy / module tests all
clean across the three grouping scenarios (default, with
table_sample_merge, and opt-out).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bit of clean up

* Manual review of docs

* Schema version: assume semver, only throw error on major version bump

* Remove some excessively cautious code

* Make code way less defensive still.

If the data is that badly mangled, I'd rather it throw an exception instead of silently default to fake numbers

* Better docstring / docs

* Tidy up descriptions / helptext a bit2

---------

Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants